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Abstract 

In this paper, we consider supervised learning problems such as logistic regression and study 
the stochastic gradient method with averaging, in the usual stochastic approximation setting 
where observations are used only once. We show that for self-concordant loss functions, after n 
iterations, with a constant step-size proportional to 1/R 2 \fn where n is the number of observa- 
tions and R is the maximum norm of the observations, the convergence rate is always of order 
0(1 /\/n), and improves to 0(R j jin) where fi is the lowest eigenvalue of the Hessian at the 
global optimum (when this eigenvalue is strictly positive). Since /i does not need to be known 
in advance, this shows that averaged stochastic gradient is adaptive to unknown local strong 
convexity of the objective function. 



1 Introduction 

The minimization of an objective function which is only available through unbiased estimates of the 
function values or its gradients is a key methodological problem in many disciplines. Its analysis 
has been attacked mainly in three communities: stochastic approximation 0, 0, 0, 0, 0, @] , optimiza- 
tion 0, [|| , and machine learning 0, [13, 11,12,13,14, HI ■ The main algorithms which have emerged 



are stochastic gradient descent (a.k.a. Robbins-Monro algorithm), as well as a simple modification 
where iterates are averaged (a.k.a. Polyak-Ruppert averaging). 

The convergence rates of these alg orithms depends primarily on the potential strong convexity of 
the objective function [U O El Q [H] . For /^-strongly convex functions, the optimal rate of 
convergence of function values is 0(l/n) while for convex functions the optimal rate is 0(1/ ^/n) [20|, 
Elj l . For smooth functions, averaged stochastic gradient with step sizes proportional to 1 /^/n achieves 
them up to logarithmic terms [161 ] . 

Convex optimization problems coming from supervised machine learning are typically of the form 
f(9) = E[^(j/, (8, x))] , where £(y, (6, x)) is the loss between the response y and the prediction (8, x). 
They may or may not have strongly convex objective functions. This most often depends on (a) 
the correlations between covariates x, and (b) the strong convexity of the loss function i. The 
logistic loss u 1 v log(l + e~ u ) is not strongly convex unless restricted to a compact set; moreover, 
in the sequential observation model, the correlations are not known at training time. Therefore, 
many theoretical results based on strong convexity to do not apply. The goal of this paper is 
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to show that with proper assumptions, namely self-concordance, one can readily obtain favorable 
theoretical guarantees for logistic regression, namely a rate of the form 0(R 2 / [in) where [i is the 
lowest eigenvalue of the Hessian at the global optimum, without any exponentially increasing constant 
factor. 

Another goal of this paper is to design an algorithm and provide an analysis that benefit from 
hidden local strong convexity without requirin g t o know the local strong convexity constant in 
advance. In smooth situations, the results of |l6[ implies that the averaged stochastic gradient 
method with step sizes of the form 0(n -1 / 2 ) is adaptive to the strong convexity of the problem. 
However the dependence in p. in the strongly convex case is of the form 0(/i~ 2 ?i -1 ), which is sub- 
optimal. Moreover, the final rate is rather complicated, notably because all possible step-sizes are 
considered. Finally, it docs not apply here because even in low-correlation settings, the objective 
function of logistic regression cannot be globally strongly convex. 

In this paper, we provide an analysis for stochastic gradient with averaging for generalized linear 
models such as logistic regression, with a step size proportional to (R 2 \fn)~ 1 where R is the radius of 
the data and n the number of observations, showing such adaptivity. In particular, we show that the 
algorithm can adapt to the local strong-convexity constant, i.e., the lowest eigenvalue of the Hessian 
at the optimum. The analysis is done for a finite horizon N and a constant step size decreasing 
in TV as l/i? 2 %/]V, since the analysis is then slightly easier, though (a) a decaying stepsize could 
be considered as well, and (b) it could be classically extended to varying step-sizes by a doubling 
trick [13 . 

2 Stochastic approximation for generalized linear models 

In this section, we present the assumptions our work relies on, as well as related work. 
2.1 Assumptions 

Throughout this paper, we make the following assumptions. We consider a function / defined on 
a Hilbert space %, and an increasing family of a-fields (F n )n^i] we assume that we are given a 
deterministic 6>o e H, and a sequence of functions /„ : H — >• M, for n ^ 1. Wc make the following 
assumptions: 

(Al) Convexity and differentiability of /: / is convex and three-times diffcrentiable. 

(A2) Generalized self-concordance of / [18]: for all 81,62 <G H, the function <p : t i-> f\9\ + 
t(9 2 - 61)] satisfies: Vt € R, \<p'"(t)\ < i?||6>i - 6 2 \\<p"(t). 

(A3) Attained global minimum: / has a global minimum attained at 6* € %. 

(A4) Lipschitz-continuity of f n and /: all gradients of / and /„ are bounded by R, that is, for 
all 6 e U, 

||/' (0)|| < R and Vn > 1, ||/4(0)|| ^ R almost surely. 
(A5) Adapted measurability: Vn 1, /„ is Jvt-mcasurablc. 
(A6) Unbiased gradients: Vn > 1, E(/;(0„_i)|.F„_i) = f(0„_i). 
(A7) Stochastic gradient recursion: Vn 1, 6 n = Q n -\ — lnf' n {Qn-i)- 
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Among the seven assumptions above, the non-standard one is (A2): the notion of self-concordance 
is an important tool for convex optimization and in particular the study of Newton's method. It 
corresponds to having the third derivative bounded by the |-th power of the second derivative. For 
machine learning, [18[ has generalized the notion of self-concordance by removing the |-th power, 
so that it is applicable to cost functions arising from probabilistic modeling, as shown below. 

Our set of assumptions corresponds to the following examples (with i.i.d. data, and T n equal to the 
a- field generated by xi,yi, . . . , x n ,yn)- 

Logistic regression: f n (6) = log(l +cxp(— y n (x n , 0))), with data x n uniformly almost surely 
bounded by R and y„ € { — 1,1}. Note that this includes other binary classification losses, 
such as f n ($) = -y„(x n ,6) + \/l + {x n ,6) 2 . 

Generalized linear models with uniformly bounded features: f n (8) = — (6, $>(x n ,y n )}+ 
log J h(y) exp ((#, Q(x n , y)))dy, with Q(x n ,y) almost surely bounded in norm by R, for all 
observations x n and all potential responses y. This includes multinomial regression and con- 
ditional random fields [19j . 

- Robust regression: we may use /„(#) = <p(y n — (x n , &)), with ip(t) = log cosh t = log ^-^j — , 
with a similar boundedness assumption on x n . 



2.2 Related work 

Non-strongly-convex functions. When only convexity of the objective function is assumed, 
then several authors @,H, H, 14, 15 have shown that using a step-size proportional to 1/y/n, together 
with some form of averaging, leads to the minimax optimal rate of 0(l/^/n) [2oL 21\. Without 
averaging, the known convergences rates are suboptimal, that is, averaging is key to obtaining the 
optimal rate Note that the smoothness of the loss does not change the rate, but may help to 
obtain better constants, with the potential use of acceleration 

The compactness of the domain is often used within the algorithm (by using orthogonal projections) 
and within the analysis (in particular to optimize the step size and obtain high-probability bounds) . 
In this paper, we do not make such compactness assumptions, since in a machine learning context, 
the available bound would be loose and hurt practical performance. 

Another difference between several analyses is the use of decaying step sizes of the form 7„ oc 1/y/n 
vs. the use of a constant step size of the form 7 cx 1/y/N for a finite known horizon N of iterations. 



The use of a "doubling trick" as done by [17J for strongly convex optimization, where a constant step 
size is used for iterations between 2 P and 2 P+1 , with a constant that is proportional to 1/ 1 \p2P , would 
allow to obtain an anytime algorithm from a finite horizon one. In order to simplify our analysis, 
we only consider a finite horizon N and a constant step-size that will be proportional to 1/y/N. 



Strongly-convex functions. When the function is /i-strongly convex, i.e., 9 n- f(0) 
convex, there are essentially two approaches to obtaining the minimax-optimal rate OCl //i n) 12 J 21 
(a) using a step size proportional to 1/ jxn with averaging for non-smooth problems (7lla. ll3 . ll4L lil 2J] 
or a step size proportional to l/(i? 2 + n^), also with averaging, for smooth problems, where R 2 is the 



smoothness constant of the loss of a single observation [24|; (b) for smooth problems, using longer 
step-sizes proportional to l/n a for a € (1/2, 1) with averaging (il. [HI. Hij . 

Note that the "historical" step size, i.e., of the form C/n where C is larger than 1/fx, leads, without 
averaging to a convergence rate of 0(l//i 2 n) @, hence leads to a worse dependence on //. 
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The solution (a) requires to have a good estimate of the strong-convexity constant fj., while the 
second solution (b) does not require to know such estimate and leads to a convergence rate achieving 
asymptotically the Cramer- Rao lower bound Thus, this last solution is adaptive to unknown 
(but positive) amount of strong convexity. However, unless we take the limiting setting a = 1/2, it 
is not adaptive to lack of strong convexity. While the non-asymptotic analysis of [l6j already gives 
a convergence rate in that situation, the bound is rather complicated and also has a suboptimal 
dependence on /x. One goal of this paper is to consider a less general result, but more compact (note 
also that the analysis of [HI only applies for globally strongly convex functions, see below). 

Finally, note that unless we restrict the support, the objective function for logistic regression cannot 
be globally strongly convex (since the Hessian tends to zero when \\6\\ tends to infinity). Another 
goal of the paper is to show that stochastic gradient descent with averaging is adaptive to the local 
strong convexity constant, i.e., the lowest eigenvalue of the Hessian of / at the global optimum, 
without any exponential terms in RD (which would be present if a compact domain of diameter D 
was imposed and traditional analyses were performed). 



Adaptivity to unknown constants. The desirable property of adaptivity to the difficulty of an 
optimization problem has also been studied in several settings. Gradient descent with constant step 
size is for example naturally adaptive to the strong convexity of the problem (see, e.g., [HI). In 
the stochastic context, [2H ] provides another strategy than averaging with longer step sizes, but for 
uniform convexity constants. 



3 Non-strongly convex analysis 



In this section, we study the averaged stochastic gradient method in the non-strongly convex case, 
i.e., without any (global or local) strong convexity assumptions. We first recall existing results in 
Section r3.lt that bound the expectation of the excess risk leading to a bound in 0(l/y/n). We 
then show using martingale moment inequalities how all higher-order moments may be bounded in 
Section 15721 still with a rate of 0(1/ s/n). However, in Section ITOl we consider the convergence of 
the squared gradient, with now a rate of 0(l/n). This last result is key to obtaining the adaptivity 
to local strong convexity in Section 21 



3.1 Existing results 

In this section, we review existing results for Lipschitz-continuous non-strongly convex problems 0, 
H j 13, 14, 15 1 . Note that smoothness is not needed here. We consider a constant step size y n = 7 > 0, 
for all n ^ 1. We denote by 9 n = i J^k=o ® k tne avera g e d iterate. 

We prove the following proposition, which provides a bound on the expectation of f(0 n ) — /(#*) 
that decays at rate 0(7 + 1/jn), hence the usual choice 7 cx 1/y/n: 

Proposition 1 With constant step size equal to 7, for any n ^ 0, we have: 

Ef(-Y / 8 k _ 1 )-f(0*) + -?-E\\9 n -e4 2 -L\\6 -64 2 + iR 2 . (1) 
\n J / Z'fn z^n 2 

Proof. We have the following recursion, obtained from the Lipschitz-continuity of /„: 

\K -e*\\ 2 = \K-i - M 2 - 27<*»-i - 0*,/;(0«-i)> + 7 2 ll/;(0n-i)f 
< ||0n-i - M 2 - 27(^-1 - 0*,/'(0„-i)) + i 2 R 2 + M n , 
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with 

M n = -2 7 <e B _i - - /'(^-i)). 

We thus get, using the classical result from convexity /(0„_i) — / (#*) < (8 n -i — f'(0n-i))' 

27[/(0«-i)-/(0*)] < ||^-i-^|| 2 -||^-^|| 2 + 7 2 J R 2 +Af„. (2) 
Summing over integers less than n, this implies: 

1 i i i " 

-E/^)-/(^) + ^ll^-^H 2 < 7^\\d a -eA\ 2 + 1 -R 2 + — Y,Mk- 

n f— ' 2771 2771 2 2771 f— ' 

We get the desired result by taking expectation in the last inequality, and using the expectation 
EM fe = E(E(M fc = and /(i E^o e < £ Efc=5 /(**)■ ■ 

The following corollary considers a specific choice of the step size (note that the bound is only true 
for the last iteration): 

Corollary 1 With constant step size equal to 7 = - ^ 7^ > we have: 
Vn G {1,...,JV}, E||0. 

Note that if \\6q — #*|| 2 was known, then a better step-size would be 7 = -1^2-^=^, leading to a 
convergence rate proportional to 3Mi!-JLl1_ However, this requires an estimate (simply an upper- 
bound) of 1 1 6*o — 9*\\ 2 , which is typically not available. 
We are going to improve this result in several ways: 

- All moments of \\0 n — (9*|| 2 and f(0 n ) — /(#*) will be bounded, leading to a sub-Gaussian 
behavior. Note that we do not assume that the iterates are restricted to a predefined bounded 
set (which is the usual assumption made to derive tail bounds [1, [27| ) ■ 

- We are going to show that the squared norm of the gradient at 6 n = — XX=i ^fc-i converges at 
rate 0(n _1 ) , even in the non-strongly convex case. This will allow us to derive finer convergence 
rates in presence of local strong convexity in Section |4j 



1 



b: 2 



-no.) < -^i*-. 



1 



(3) 
(4) 



3.2 Higher-order bound 

In this section, we prove higher-order bounds (see the proof in Appendix [Cl which is based on taking 
powers of the inequality in Eq. @ and using martingale moment inequalities) , both for any constant 
step-sizes and then for the specific choice 7 = 2fl2 1 v /]v ■ 

Proposition 2 With constant step size equal to 7, for any n and integer p G {1, . . . , [n/4j} ; 
we have: 

E^2 7 n[/(0 n )-/(0*)] +\\6n-0*\\ 2 J ^(3\\ev-04 2 + 20npj 2 R 2 ) p . (5) 
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Corollary 2 With constant step size equal to 7 = 2R 2^/jj > f or an y integer p ^ 



1, we have: 



n\QN-0*f P < {'S\\9 Q - e.4 2 + 5pR- 2 ) P , (6) 
E[f(6 N )-f(6*)] P < (3-^\\0 -64 2 + ^=y. (7) 



Having abound on all moments allows immediately to derive large deviation bounds in the same two 
cases (by applying Lemma Q] from Appendix [A"|l : 

Proposition 3 With constant step size equal to 7, for any n and t 0, we have: 
f(S n ) - f(e.) > 30jR 2 t+ nd °- 942 ) < 2exp(-i), 



•) n 

^ 6O7i7 2 R 2 i + 6||0 o -0*f I < 2exp(-i). 



Corollary 3 FFii/i constant step size equal to 7 = 2R }^ > f or an V t ^ we have: 



' N 



6*\\ 2 ^ 15R- 2 t + 6\\0 o -6* || 2 < 2exp(-i). 



We can make the following observations: 

- The iterates 9 n and 8 n do not necessarily converge to (note that 0* may not be unique in 
general anyway). 

- Given that (E[/(0„) - /(0»)] p ) 1 / p is affine in p, we obtain a sub-exponential behavior, i.e., tail 
bounds similar to an exponential distribution. 

- The proof of Prop. [2]is rather technical and makes heavy use of martingale moment inequalities. 

- The constants in the bounds of of Prop. [5] (and thus other results as well) could clearly be 
improved. In particular, we have, for p = 1, 2, 3, 4 (see proof in Appendix IE]) : 

E^2 7 7i[/(6 T „)-/(r)] + ||0„-0*|| 2 ^ < \\6 -64 2 + ni 2 R 2 , 
E^n[f(6 n )- f(e*)]+\\e n -04 2 ^j < (||^-^|| 2 + 3n7 2 J R 2 ) 2 , 

E^2 in [f(e n )- f(9*)]+\\e n -04 2 ^j < (\\9 Q -04 2 +6 ni 2 R 2 ) 3 , 

E^2 7 4/(0 n )-/(0*)] + ||0„-0*f) < (\\9 Q -04 2 +9n~f 2 R 2 )\ 

3.3 Convergence of gradients 

In this section, we prove higher-order bounds on the convergence of the gradient, with rate C^n" 1 ) 
for limOU 2 : 
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Proposition 4 With constant step size equal to 7, for any n ^ and integer p G {1, . . . , [«./4j} ; 
we have: 



k=l 



2p\ l/2p 



R 



10y/p + 40R 2 jpy/n~- 



(8) 



Corollary 4 FFii/i constant step size equal to j — 2R2y r^ 
have: 



for any integer p G { 1 , . . . , [ N / 4J } , we 



E 



N 



'k-1 



k=l 



2p\ l/2p 



R 



10y/p + 20p + 6R 2 \\9 Q - 9,\\ 2 + AR\\9 Q - 9, 



(9) 



Wc can make the following observations: 

- The squared norm of the gradient ||/'(^n)|| 2 converges at rate 0(n _1 ). 

- Given that (E||f((9„)|| 2p ) 1/2p is affinc in p, we obtain a subcxponcntial behavior for |j/'(0 n )||, 
i.e., tail bounds similar to an exponential distribution. 

- The proof of Prop. [4] makes use of the self-concordance assumption (that allows to upperbound 
deviations of gradients by deviations of function values) and of the proof technique of Q . 

- The various terms may be improved for small p. In particular, we have, for p = 1, 2: 



E 



E 



v k=l 



2 \ 1/2 



fe=l 



4 \ 1/4 



R 



R 



3 + 

5 + 6-/VnR 2 



-yVn~R 2 
2 



R\\0q — 9* 
R\\9q — 9^ 



1 



1 

■y^n.R 2 



R 2 \\e 



R 2 \\9o 



4 Self-concordance analysis 



In the previous section, we have shown that ||/'(#n)|| 2 is of order 0(n _1 ). If the function / was 
strongly convex with constant [i > 0, this would immediately lead to the bound f(9 n ) — /(#*) ^ 
j^\\f'(9 n )\\ 2 , of order 0(/i _1 ri _1 ). However, because of the Lipschitz-continuity of / on the full 
Hilbert space %, it cannot be strongly convex. In this section, we show how the self-concordance 
assumption may be used to obtain the exact same behavior, but with /1 replaced by the local strong 
convexity constant. 

The required property is summarized in the following proposition about (generalized) self-concordant 
function (see proof in Appendix IB.1[) : 

Proposition 5 Let f be a convex three-times differ entiable function from % to R, such that for all 
61,62 G H, the function ip : t ^ f[6i + t(6 2 - 6>i)] satifies: Vf G R 7 \<p"'(t)\ < R\\9i - 9 2 \\<p"(t). 
Let (9* be a global minimizer of f and fi the lowest eigenvalue of f"(8*), which is assumed strictly 
positive. 



If ^ — , then 

l l 4 



^JW and m -f(9,)^Mml. 



We may now use this proposition for the averaged stochastic gradient. For simplicity, we only 



consider the step-size 7 



, and the last iterate: 
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Proposition 6 Assume 7 = 2R }^ ■ Let /i > be the lowest eigenvalue of the Hessian of f at the 
unique global optimum 8* . Then: 

E/(0*)-/(0.) < jL( 5 R\\6 - 64 + 15^ , 

E||0 W -0,|| 2 < ^-Ur\\0 o -9.A\+2O^ 



We can make the following observations: 



The proof relics on Prop. [5] and requires a control of the probability that ^ ( e v)\\R ^ 3^ wn j cn 
is obtained from Prop. |H 

- We conjecture a bound of the form (■^(□i?||0 o — &*\\ + ^\^p) 4 ) P f° r the p-th order moment 

of /(M - /(#*)• 

- The key elements in the previous proposition are that (a) the constant /1 is the local convex- 
ity constant, and (b) the step-size does not depend on that constant fi, hence the claimed 
adaptivity. 

- The bounds are only better than the non-strongly-convcx bounds from Prop. [TJ when the 
Hessian lowest eigenvalue is large enough, i.e., /iR 2 ^/N larger than a fixed constant. 



5 Conclusion 



In this paper, we have provided a novel analysis of averaged stochastic gradient for logistic regression 
and related problems. The key aspects of our result are (a) the adaptivity to local strong convexity 
provided by averaging and (b) the use of self-concordance to obtain a simple bound that does not 
involve a term which is exponential in R\\0q — 6*\\, which could be obtained by constraining the 
domain of the iterates. 

Our results could be extended in several ways: (a) with a finite and known horizon N, we considered 
a constant step-size proportional to l/i? 2 v^/V; it thus seems natural to study the decaying step 
size 7„ = 0(1/ B^y/n), which should, up to logarithmic terms, lead to similar results (and thus 
likely provide a solution to a a recently posed open problem for online logistic regression [2a|): (b) 
an alternative would be to consider a doubling trick where the step-sizes are piecewise constant; 
Finally, (c)_it may be possible to consider other assumptions, such as cxp-concavity [l7| or uniform 
convexity [26j . to derive similar or improved results. 
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A Probability lemmas 



In this appendix, we prove lemmas relating bounds on moments to tail bounds, with the traditional 
use of Markov's inequality. 

Lemma 1 Let X be a non-negative random variable such that for some positive constants A and 
B, and all p G {1, . . . , n}, 

F,X P ^ {A + Bpf. 

Then, if t < §, 

F{X > 'SBt + 2A) < 2 exp(-i) . 

Proof. We have, by Markov's inequality, for any p £ {1, . . . , n}: 

EX p (A + Bn)P 

F(X ^ 2Bp + 2A) s$ — -A- ^— sC exp(- log(2)n). 

K F ' (2Bp + 2A)p (2A + 2B p )p PV 5V ,y > 

For u G [1, n], we consider p = [u\ , so that 

F(X ^ 2Bu + 2A) ^ P(X ^ 2Bp + 2 A) ^ exp(- log(2)p) < 2 exp(- log(2)u) . 

We take t = log(2)u and use 2/ log 2 ^ 3. This is thus valid if t ^ §. ■ 



Lemma 2 Let X be a non-negative random variable such that for some positive constants A, B and 
C , and for all p G {1, . . . , n} 7 

EX p ^ (A^+Bp + C) 2p . 

Then, ift^n, 

F(X ^ (2AVi + 2Bt + 2C) 2 ) ^ 4 cxp(-t). 
Proof. We have, by Markov's inequality, for any p G {1, . . . , n}: 

F(X > (2AV-P + 2B P + 2Cf) < {2A ^ p + 2C)2p < ( ^g p + + ^ P * ^ 
For u G [1, n], we consider p — \ u\ , so that 

F(X ^ (2Ay/u~ + 2Bu + 2Cf) sC F(X ^ (2Ay/u + 2Bu + 2C) 2 ) ^ exp(- log(2)p) 4 exp(- log(4)u) 
We take t = log(4)u and use log 4 ^ 1. This is thus valid if t ^ n. ■ 



B Self-concordance properties 

In this appendix, we show two lemmas regarding our generalized notion of self-concordance, as well 



as Prop. [5] For more details, see 18j and references therein 



Lemma 3 Let p:[0,l]4la convex function such that for some S > 0, Vt G [0, 1], ^ 
S(p"(t). Assume tp'(0) = 0, tp"(Q) > 0. Then: 

^5 > 1 - e~ s and < <p(0) + ^(1 + 5). 
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Moreover, if a — v ffJ? < 1, then tp(l) ^ <^(0) + — f \ — log . If in addition a ^ §, i/ien 

v W y> "(Oj a 1 — a 

tp(l) < p(0) + 2^ and p"(0) < 2 ¥ /(l). 

Proof. By self-concordance, we obtain that the derivative of u i— > logy>"(u) is lower-bounded 
by — S 1 . By integrating between and i G [0, 1], we get 

log ¥>"(*) - logV'(O) > -5* , i.e., ^'(t) > ^"(0)e- s *, 

and by integrating between and 1, we obtain (note that tp'(0) = 0): 

tp>(l)> tp"(0)^fl. (10) 
We then get (with a first inequality from convexity of tp, and the last inequality from e s ^ 1 + S): 

vm _ ,<„> < m <^^.W( s + ^)<™^ s, 

Eq. (fT0|) implies that a ^ 1 — e~ s , which implies, if a < 1, S < log y^j. This implies that 

tp[l) - v?(0) < tp (1) — TTTTTT n =~s ^ „/ n N ~ log z , 

tp yd) 1 — e 6 V 5 (0) a 1 — a 

using the monotinicity of S i— >■ pr^s ■ Finally the last bounds are a consequence of ^ ^ — log j^— ^ 
2, which is valid for a ^ |. 



Lemma 4 Let f be a convex three-times differ entiable function from ft to R, such that for all 
01,02 G ft, the function tp : t h-> /[0 X + f(0 2 - 0i)] satires; V< G R, < #||01 -0 2 \\ip"(t). For 

any 0\,0 2 G H , we have: 

\\f (0!) - f(0 2 ) - r(0 2 )(02 - 0l)|| < «[/(0l) - /(0 2 ) - (/'(02), 02 - 0!)] ■ 

Proof. For a given z G ft of unit norm, let ^(t) = (z, /'(0 2 +t(0i - 2 )) - /' (0 2 ) -tf '(0 2 )(0 2 -0i)) 
and ^(t) = i?[/(0 2 + i(0 x - 2 )) - /(0 2 ) - t(f'(9 2 ),9 2 - ft)]. We have ^(0) = ^(0) = and 
</?'(0) = ^'(0) = 0. Moreover, we have <p"(t) ^ ^"(i) (using the same reasoning as in the proofs 
of [l8| )■ We thus have (p(l) ^ "0(1)7 which leads to the desired result by maximizing with respect 
to z. ■ 



B.l Proof of Prop. [5] 

Define tp : t i-> f[0 m +t(6 - 0*)] - /(<M- Wc have: </?(0) = tp'(0) = 0,0^ tp'(l) = (/'(0),0- 0*) s$ 
||/'(0)||||0-0*||, <p"(0) = (0-0*,/"(0*)(0-0*)) ^ i u||0-0*l| 2 ! and ^ for all i G [0,1], and 
tp"'(t) ^ R\\6 - 6*\\tp"{t) for all t G [0, 1], i.e., S = R\\0 - 0*||. Lemma [3] leads to the desired result, 
with a = < Note that we also have, for all G ft, 

¥> (0) M ' ' 

/(0)-/(0*) < (l + i?||0-0*||)^M- and ||0-0*|| s$ (l + ft||0-0*||) ^M . 
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C Proof of Prop. H 



We consider a direct proof based on taking powers of the inequality in Eq. ^ , and then using the 
appropriate martingale properties. 

C.l Derivation of recursion 

We have the recursion: 

2 7 [/(0„_i)-/(0,)] +\K-04 2 < -e4 2 + 1 2 R 2 + M n , 

with 

M n = -2 7 (0„_ 1 - 0*,/;(0n-i) - f'(O n -i)}- 



2^nf(-f2h-i) -2 1 nf(6*) + \\0 n -64 2 ^ A„ 



This leads to 

2 7 n/( 

n 

k=l 

with A„ = \\6 -64 2 + ni 2 R 2 + Y,l =1 M k . Note that E(7\/ fe |J" fe _i) = and |M fc | s$ 4 7 i2||0 fe _i-(9*|| s$ 

1/2 

47iL4 A ,_ 1 almost surely. This leads to, by using the binomial expansion formula: 

K < (4.-1 + 7 2 i? 2 + M n V = £ (fi + i 2 R 2 f~ k M k n 



< + 7 2 i? 2 ) P + (A„-i + 7 2 J2 2 ) P_1 M n + E U + 7 2 i? 2 ) P " fe (47i?4 /2 i) fe - 



p 

fc=2 v 

2p2 



This leads to (using E(M n \F n -i) = and upper bounding y R by 4 7 2 i? 2 ): 

E^n-x] < (A„_i + 4 7 2 i? 2 ) p + J2 (?) (^-x + 4 7 2 i? 2 ) p - fe (4 7 i?4( 2 1 ) fe 



fc=2 



= + 4 7 2 i? 2 + 4 7 i?4 1 /_ 2 1 ) P - 4 7 i?p(A„_ 1 + 4 7 2 i? 2 ) p - 1 4/_ 2 1 

< (A 1 ^ + 2 7 i?) 2p - 4 7 i*p(A n _ 1 + 4 7 2 i? 2 )^ 1 ^ r 1 /_ 2 1 



E (t)Al 1 (27i?) 2p - fc -4 7 i?^_ 2 1 E ( p ; 1 )iiiUi(^) a ^ 1 -* ) 



with 



fe=0 



[f] forge {0,...,p}, 

c 2g+ i = (o 2p 1 V 2 pf P_1 ) for <?e {o,...,p-i}. 



v 2 g + iy ^\ q 

In particular, C = 1, C 2p = 1, Ci = and C 2p -i = (^J - 2p(£\) = 
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We have, for g e {1, . . . ,p — 2}, 

2q + 1 / 2p \ 2q+l 



&2q+l ~ ~ T ^ 



2p - 2g - 1 \2q + \) 2p - 2q - 1 

(2p)\ 2q+l 



(2g + l)!(2p-2g-l)!2p-2g-l 

(2p)! 2p- 2g _ /2jA 2p- 2g 



(2g)!(2p - 2q)\ 2p - 2q - 1 \2q) 2p - 2q - 1 ' 

For q=p-2, we obtain C 29+ i 2 p-2q-i ^ ^gf , while for g ^ p-3, we obtain C 2g +i 2 p~2q-i ^ ^gf- 
Moreover, for ge {1, . . . ,p — 2}, 

2g-2g-l f 2p \2p-2q-l 

^2q+l ~ — 



2g + 1 \2q + lJ 2q + l 

(2p)l 2p-2q-\ 
(2g + l)!(2p-2g-l)! 2g + 1 

{2p)\ 2q + 2 ( 2p \2q + 2 



(2q + 2)\(2p - 2q - 2)! 2g + 1 \2q + 2) 2q + 1 ' 

For g = 1, we obtain C 2q+1 < C 2g+2 |, while for g ^ 2, we obtain C 2q+1 2p - q 2 ^ < C 2(?+2 § . 

We have moreover 

c 2q+1 Al + _T{iiR) 2p ~ 2q - 1 

= C 2g+1 ^_ 1 (2 7 i?) 2 ^ 2 ^ 2 4( 2 1 (2 7j R) 

1 4 / _ 2 1 (2 7j R) 



< C 2g+1 ^_ 1 (2 7j R) 2 P- 2 ^ 2 i 



29+1 -(2 7 i i ) 2 + ^flA„_ 1 



2p - 2g - r ' 2g + 1 



- ^J-I^a^h^ + ^ 2?+I ^i±L_^ l(27fl) ^-, 

By combining all elements, we get that the terms indexed by 2g + 1 are bounded by the terms 
indexed by 2g + 2 and 2q. All terms with q G {2, . . . ,p — 3} are expanded with constants |, while 
for q — 1 and q = p — 2. this is |. Overall each even term receives a contribution which is less than 
max{§, § + | , |} = if. This leads to 

EC 2g+1 ^+ 1 / 2 (2 7 i?) 2 ^ 2 ^ 1 < ggc 2g A r V 1 (2 7 i?) 2 ^ 2 «, 

q=l q=0 

leading to the recursion that will allow us to derive our result: 

0-1 



q=0 ^ ^' 



(11) 



C.2 First bound 

In this section, we derive an almost sure bound that will be valid for small n. Since \\8 n — 9*\\ ^ 
\\&n-i — + lR almost surely, we have \\8 n — 0* | ^ \\9 — 9* \\ + njR for all n > 0. This in turn 
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implies that 



A n ^ ||0 O -94 2 + ni 2 R 2 +A 1 RY,\\Ok-i 

k=l 
n 

A n «S \\9 -64 2 + nj 2 R 2 +4 1 Rj2[\\9 -64 + (k-l) 1 R] 



k=l 

2 i „.,2 n2 i 4„,^r>\\n n II i i„,2 n2„2 



s$ \\e Q - 94 z + nYR z + ^inR\\9 - 04 + 2YR 2 n 2 
^ \\0a-04 2 + nj 2 R 2 + 2 7 2 n 2 R 2 + 2\\6 - 0» || 2 + 2 7 2 i?V 
sC 3||0 O - 84 2 + 5«7 2 i? 2 almost surely. (12) 



C.3 Proof by induction 

We now proceed by induction on p. If we assume that EA| ^ (3||0o — 0*|| 2 + kq"/ 2 R 2 A) q for q < p 
and a certain A (which we will take to be equal to 20). We first note that if n ^ 4p, then from 
Eq. 1[I2"|). we have 

£ AS s$ (3||0o-0*|| 2 + 57i 2 7 2 i? 2 ) P 
s$ (3\\9 -e4 2 + 20n P1 2 R 2 ) p . 

Thus, we only need to consider n ^ 4p. We then get from Eq. (fTTj) : 

10 fe=o g=0 

< ||0o - 0*|| 2p + S EE ff ) ( 3|l ^° - * !|2 + ^7 2 i? 2 A) 9 (2 7 i?) 2p - 2? 

fc=0 g=0 ^ q ' 

using the induction hypothesis, 
= ||0o - 04 2p + ^ E (f) (27i?) 2p - 29 E (3||<9o - 0*|| 2 + fc 9 7 2 i? 2 A) 9 

* 11*0 - °*\\ 2p + |E g)( 2 7i?) 2p - 29 E 33 n^ - M 2j ( J) Gr^r'^^ 



9=0 ^ " j=0 

usine 



sing fc Q < for any a > 0, 

a + 1 

fc=o 



oo - m 2p + §x>k - ^ w^v-'E g) (J) n " P+1 



j'=o 

We want to show that it is less than 



V Vi/ 4 9 - J + 1 



(3||0 O - 0*|| 2 + k P1 2 R 2 A) p = 3 p ||0 o - 0*|| 2p + E 33 H o - e4 2 ^ 1 2 R 2 nY^{Apf^ ( P . 

3=0 ^ J 

By comparing all terms in ||0o — 0*|| 2j , this is true as soon as for all j £ {0, . . . ,p — 1}, 
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This is implied by (if n 4p): 



136 
15~ 



fc=0 

p-i-3 



fe=0 

We have 



*W 2 P \(p-i-fc)...(p-fc-j- + i) w 
^ P ^ + 2; p ...(p_ i + i) ^ J 

M P Y' a -i-u p - k - P+ j( IP \(p-3)--(p-k-j + l) P -i- k -j 
15 F \2k + 2J p---(p-k) yF 1 



136 P y\ 3 A-l-k-k-p+jf 2 P \ (p - j) ■ ■ ■ (p - k - j + 1) , , sp-l-fe-j 

"15 ^ ^ P U + 2J P ---(p-k) U> 1 *J 

< ™ P S^" A -i-k v -k-P+j ( 2p \ pk pP-i-ifc-j 
" 15 ^ ^ V2fc + 2y P ---(p-fc)^ 

136 P «^ J j_ fc _2k-if 2p \ 1 

"15 ^ P ^2fc + 2jp-..(p-fc) 

136 t _ fc p- 2 "- 1 2p(2p - 1) ■ ■ ■ (2p - 2fc - 1) 

15 ^ (2fc + 2)! p---(p-k) 

136 ^ p- 2 *-^ 2 p(p - 1/2) ■ ■ ■ (p - fc - 1/2) 

15 (2k + 2)\ p---{p-k) 

136 P ^ 3 ._ x _ k 2 2fc + 2 



< 15 £ " ( 2fc + 2 ) ! 

136 ^ (2/ % /I) 2fc + 2 136 r , , , r-r. , 

< T5- £ 2fc + 2)! = [ C ° Sh(2/VI) - 1] < ^ ^ < 20. 

k=0 y ' 

We thus get the desired result 

EAP < (3||0 O - <M| 2 + 20np 7 2 i? 2 ) p . 

D Proof of Prop. [4] 

The proof is organized in two parts: first show a bound on — X)fc=i f'i^k— i)j then relate it to 
/'( n Sfe=i ^fe-i ) usm g self-concordance. 
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D.l Bound on J £Li 

We have, following HG1: 



/n(^n-l) — _ (#n-l _ #n)i 

7 



which implies, by summing over all integers between 1 and n: 

1 " 1 ™ 1 1 

-£m-i) = -£[.m-i)-/^-i)] + — (0o-0*) + — o?.-<u 



fc=i 



777, 



7n 



We denote X k = - /fe(^-l)]- We have: ||X fe || < ^ and E(X fc |J- fc _i) = 0, with 

(XX=i^(ll^fcll 2 |^ r fc-l)) 1 2 ^ We may thus apply the Burkholdcr-Roscnthal-Pinclis inequal- 
ity [H, Theorem 4.1], and get: 



3 



1 " 

-j2[f'( e k-i)-f' k (o k -i)] 



k=l 



2p-i l/2p 



2R 



^ 2p— + ^2p~ 



2R 



This leads to, with p ^ L n /4J : 



E 



1 2 

n ^ — * 



fc=i 



2f 



l/2p 



2i? 



77 



+ + — Po -04 + [— v / l!0o-0*|| 2 + 18np 7 2i?2] 

n 1 /^ 7n 777 



< 2p— + v^^r + — 1|0„ - 0*|| + [—||0o -64 + —^18^ 1 R] 

777 777 J 

|0o-0*ll + — 



77 



n l/2 ry n ' 

2 



^2R r— 2R. 

V77 71 / 771 

< Vp--i[2 + 2V2 + \/r8l +— 1|6» 

V77 771 

^ iov^^= + — ||0 O -0*||. 

Y77 777 



1 " 



y / 187ip7i? using -y/p ^ Vn/2, 



(13) 



D.2 Using self-concordance 

Using the self-concordance property of Lemma HI we obtain 

n , n 

77 V 77 

V fc=l 

1 ™ / 1 " \ 

- E - - rv.Wk-i 0*)] - /' - E ^-1 

fe=l ^ k=l ' 

n T / 1 ™ 

^ -E [/(^o -/(«.)-</'(«.). +-R /(-E^'-i 

< 2i?fi E/(^_i)-/(^)) ■ 



fe=i 



/ 1 " 

/'(<?*) +/"(#*) -E^-i- 
/ 1 " 

-/(0*) + (/'(0*),-E^-i 



fe=i 
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This leads to, using Prop. [2J 

E 



2R E 



2p x l/2p 



fc=i v fc=i 

- n 

n ^— ' 

fc=i 

Summing Eq. (|T3"]l and Eq. (JT3J) leads to the desired result 

E Results for small j> 



2px l/2p 



p 

s$ — [ 3 1 1 6> — 0*f + 40n P1 2 R'' 



(14) 



In Prop.dl we may replace the bound 3||0o - 
for p= 1,2,3,4: 



^n[f(9 n )-f(9*)] +\\6 n 

2 7 n[f(e n )-f(e*)] + \\e n 

2 7 n[/(0„)-/(0*)] +||0„ 



0*|| + 20npj R with a bound with a smaller constant 



< {\\9 Q -64 2 + 3nj 2 R 2 ) 2 



(\\9 -94 2 + Q ni 2 R 2 



«S (||0 o -0*j| 2 + 9n 7 2 i? 2 ) 
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This is done using the same proof principle but finer derivations, as follows. We denote j 2 R 2 = a 
and \\6 - 64 2 = 6. We have 

EA n ^ a + nb, 

EA 2 n < A 2 n _ 1 + 2A n - 1 b + b 2 + 4bA n - 1 

O 

TV 

^ a 2 + 6b[na + —b] + b 2 n 

= a 2 + 6bna + b 2 (n + 3n 2 ) 

< {a + inbf, 

EAl < {Al^ + ZAl^b + 34-i6 2 + b 3 ) + 3(4-i + 6)464-i + «& /2 /%* x 

= (Al_, + 3A 2 n _ t b + 34-1& 2 + 6 3 ) + 3(4-1 + 6)464-i + 4bA n _x[2b' 2 4* / _ 2 1 ] 

< (4i-i + 34 2 _ 1 6 + 34-i6 2 + b 3 ) + 3(4.-1 + 6)464-1 + 464-it^ 1 + 46] 

= Al_ t + i4£_!&[3 + 12 + 1] + 4-i6 2 [3 + 12 + 16] + 6 3 

= A 3 n _ 1 + 164 2 l _ 1 6 + 314-i6 2 + 6 3 

< a 3 + 166[na 2 + 36n 2 a + 6 2 (n 2 /2 + n 3 )] + 31b 2 [na + bn 2 /2] + n6 3 
= a 3 + I6nba 2 + 6 2 a[48n 2 + 31n] + 6 3 [8n 2 + 16n 3 + 31/2n 2 + n] 

= a 3 + 16?i6a 2 + 6 2 a[48n 2 + 3 In] + 6 3 [47/2n 2 + 16?i 3 + n] 

< (a + 6n6) 3 , 

EA* ^ 4l-i+ 44^-i6 + 6A 2 _ 1 6 2 + 44-i6 3 + 6 4 

+6[4_! + 24-i6 + 6 2 ]464-i + 4[4-i + 6]464-i[26 1 / 2 4\( 2 1 ] 

sC 4t_ a + 44 3 _!6 + 64 2 _ 1 6 2 + 44-i6 3 + 6 4 

+6[4_ 1 + 24-i6 + 6 2 ]464-i + 4[4-i + 6]464-i[^ i + 26] + 166 2 A 2 _ 1 

= +4-i6[4 + 24 + 8] +A 2 _ 1 6 2 [6 + 48 + 16 + 8 + 32] +4-i6 3 [4 + 24 + 32] + 6 4 

= A^_ x + 364 3 _!6 + llOA 2 ^ 2 + 604-i6 3 + 6 4 

< a 4 + 366[na 3 + 8n 2 6a 2 + 6 2 a(48?i 3 /3 + 31?i 2 / 2) + 6 3 (47/ 6ti 3 + 4n 4 + n 2 /2] 
+ 1106 2 [na 2 + 36n 2 a + 6 2 (n 2 /2 + n 3 )] 

+?i6 4 + 606 3 [na + 67i 2 /2] 
«C a 4 + 366na 3 + 6 2 n 2 a 2 [36 x 8 + 110] + 6 3 n 3 a[36 x 48/3 + 36 x 31/2 + 330 + 60] 
+6 4 n 4 [6 x 47 + 36 x 4 + 18 + 55 + 110 + 1 + 30] 

< (a + 9n6) 4 . 

F Proof of Prop. [6] 

The proof follows from Prop. [5] applied to 9 n . We thus need to provide a control on the probability 
that ||/'(0„)|| ^ £g. 
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F.l Tail bound for ||/'(0„)|| 

We derive a large deviation bound, as a consequence of Prop. 0] and Lemma 



< 4exp(-i), 



which is valid as long as t ^ n (condition from Lemma [5]). It is valid for all t, because for all 
gradients are bounded by R. 



F.2 Bounding the function values 



From Prop.0 if \\f'(0 n )\\ > fg, then /(0 n ) - /(0*) 2 



ll/'(^)ll 2 



This will allow us to derive a tail 



bound for f(6 n ) — f(@*), for sufficiently small deviations. For larger deviations, we will use Prop. [2] 
We consider the event 



\f(0n)\\ < ^ 

\ n 



lOVt + 40i? 2 7t7n ■ 



7V™ 



lR\fn 



If we have: 



and 



10Vt + 40i? 2 7 t^ *S l^Vn^f^n 
' V 3AR2R 4R 2 

1 3/j, y 7 "- Mv 7 "- 



7V n 

then, by Prop. [51 we have: 
At c { /(0„) - /(0*) < 



7Ry/n 



I0n-0.ll < r^-b 



3 4R 2i? 8i? 2 ' 



2 r 



8i? 
/Ltn 
^ Si? 2 
/in 



lOVt + 40i? 2 7Vn ■ 



lOVt + 2(0 + A 



7V n 



with □ = 2 7 i? 2 0I and A = ^=||0 O - M 2 + ^11*0 - 0*||- 
This implies that for all t ^ 0, such that 10\/i + 20Di 



f(0 n ) - m) > — 



lOVi + 2(0 + A 



jRy/n 



< 4e~ 



Moreover, we have for all t ^ (from Prop. [2]) 

f(0n)-f(0*) ^ 30 7 i?' 2 t + 



9 3 0n — 0* i 

2j - • 11 ^- < 2exp(-i). 

7 n 



We may now use the last two inequalities to bound the expectation E[/(0 n ) — /($*)]■ 
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We have: 

E[/(0„)-/(0*)] = 



For 



+ 00 



A 



f(9*)>u]du + 



J" \ 4JJ 2 T / 

A 2 8R 2 
fin 

du - 



/A 2 8fi 

'[/(An) d« 



W») -/(**) >u]du 



Ae~ l d 



[in 



lOVt + 20Dt + A 



+2 



A 



fj,n \ 4R- 

,8R 2 32R 2 I" 00 



exp 



)du using the two tail bounds, 



n/i [in 
+6O7E 2 exp 



(^ +A ) 2 _ JL||eo _ ( , t||2 v 30 7 i? 2 ' 

^ e _t ^100 + 400D 2 2t + 400D^ 1/2 + 20 A 1/2 + 40AD^dt 
4i? 2 /V^ 



30 7 i? 2 



■(^ + A) 2 -A|| 0O 
/in 4it z 771 



^ A 



,8i? 2 32i? 2 



n\x [in \ 
-6O7.R 2 exp 



/ 3 
100r(l) + 400D 2 2r(2) + 400D-r(3/2) + 20A-r(l/2) + 40Anr(i; 



1 



4 J? 2 , /iy/fi ,2 fJ_ 

fm [ AR 2 > 8R 2 



30 7j R 2 

with r denoting the Gamma function 
,8R 2 32i? 2 



using — ||0 O -0,|| 2 < 



771 



8R 2 ' 



nfi [in 



3 1 



s$ A 2 + 100 + 400D 2 2 + 400D — + 2OA-0F + 40AD 



+607i? z exp - 



,8R 2 32R 2 



30 7 i? 2 



8R 2 



nfi [in 



22 



3 1 



A 2 + 100 + 400D 2 2 + 400D--\A : +20A-\/^ : + 40An 



22 



+6O7.R 2 — 30 x 87.R 4 using e~ Q s$ — for all a > 0, 
2[i 2a 

32R 2 
n[i 



\& 2 + 100 + 800D 2 + 532D + 40AD + 57D 2 



7 = 2 ^ 2 1 ^ / _ , with a = R\\0 - □ = 1 and A = 6a 2 +4a, we get 



E[/(0jv)-/(0.)] < 



32fl 2 
Nfj, 

32R 2 
Nfi 

R 2 



-A 2 + 1489 + 40A 
4 



9a 4 + 12a 3 + 4a 2 + 1489 + 240a 2 + 160a 



€ -—5a + 15 
N[i K ' 



Note that the previous bound is valid if ^li^o - 0*\\ 2 + 7^7^ 1 1 #0 - < ^r, i.e., under the 

condition 6i? 2 ||(9o — 9*\\ 2 + AR\\6 — #*! ^ i *£w~- ^ the condition is not satisfied, then the bound is 
still valid because of Prop. [T] We thus obtain the desired result. 
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F.3 Bound on iterates 



Following the same principle as for function values, we have: 



El 



16 H / Mv^ 



16.R 2 / 



16.fi 2 ( ("A , aV 
/j 2 n \ 4i? 2 I 



P[||0 n 



P[||0 n 



'[11^ 



+ / P ||0n-0*|| ^« ^ 

' 16fi 2 ( tjy/n I a |" 



I ^ uUti- 



II 



16J? 2 / wVn 



+ A 



,|| 2 > 



2 ^ 16fl 2 ^V» | 



H 2 n y 4i? 2 
using Cauchy-Schwarz inequality. 



1/2 



E(||0„-0*|| 4 ) 



1/2 



16R 2 I H%A? 



A 



16R 2 s 2 
^ 2 ?i Mi? 2 ; 



1/2 



o - 6»*|| 2 + 3j 2 nR 2 using Prop.H 



Moreover, if we denote by t the largest solution of lOy^i + 20D£ = we have: 



Vt 



-10 + A/ioo + 2on^ -io + Wi + 2on^ 



40D 



4CO 



as soon as 20D^| > 100, since if g > 100, -1 + vTTg < 



20 



This leads to: 
E\\6 n -64 2 



i\\d n -0*\\ 2 ^U]du 



.ill 
I A 2 

,2„ D 2 



\9 n - || 2 ^ uldit 



+2exp(- V||0 o -^+3 7 w) ^^(^^^ using Prop.0 



16i? 2 
n\i 2 



4e _t d 



/i 2 n 



lOVt + 20Di + A 



2i 2 



3j 2 nR 2 I using cxp(— a) ^ for all a > 0, 



16a 2 



16i? 2 64i? 



2 f°° / 3 \ 

y e _t ( 100 + 400D 2 2t + 400D-i 1/2 + 20A-i" 1/2 + AOAD jdt 



n/x 2 /z 2 n 



9 40 4 D 4 100 2 i? 4 r 
"2 9 4 20 2 D 2 /i 2 n 
16i? 2 64i? 2 



< A 2 ^- + ( 100r(l) + 400D 2 2r(2) + 400D^r(3/2) + 20A^r(l/2) + 40ADr(l) 
n/j z /j, z n \ 2 2 



+686 x 64- 



,D 2 i? 2 
H 2 n 



2tf + lA 



^ 2 16R_ + MFP ^ 1Qo + 400D 2 2 + 400D — + 20A^V^ + 40AD 



n/i 2 /i 2 n 

2z?2 



+686 x 64 



n 2 R 

IJ?n 



-a 2 + -A 

4 2 



64R 2 



1 



- A z + 100 + 800D Z + 532D + 32A + 40AD + 686-D 4 + 686 



3 n4 



AD 2 



For 7 = 2R z^ , with a = i?||0 o - □ = 1 and A = 2a 2 + 4a, we get 

8i? 2 " 



E\\e 



N 



N/i 2 
8R 2 
Np 2 
8R 2 
N^ 2 
R 2 
Np 2 



2A 2 + 8A(32 + 40 + 343) + 8(100 + 800 + 532 + 515) 
2A 2 + 3320A + 15576 
8a 4 + 16a 3 + 32a 2 + 3320 x 2a 2 + 3320 x 4a + 15576 
(5a + 20) 4 . 



The previous bound is valid as long as ^ 10 2 ° 00 = 500. If it is not satisfied, then Prop. [T]shows 
that it is still valid. 



References 

[1] M. N. Broadie, D. M. Cicek, and A. Zeevi. General bounds and finite-time improvement for 
stochastic approximation algorithms. Technical report, Columbia University, 2009. 

[2] H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applica- 
tions. Springer- Vcrlag, second edition, 2003. 



21 



[3] O. Yu. Kul'chitskh and A. E. Mozgovoi. An estimate for the rate of convergence of recurrent 
robust identification algorithms. Kibernet. i Vychisl. Tekhn., 89:36-39, 1991. 

[4] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM 
Journal on Control and Optimization, 30(4):838-855, 1992. 

[5] D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical 
Report 781, Cornell University Operations Research and Industrial Engineering, 1988. 

[6] V. Fabian. On asymptotic normality in stochastic approximation. The Annals of Mathematical 
Statistics, 39(4):1327-1332, 1968. 

[7] Y. Nesterov and J. P. Vial. Confidence level solutions for stochastic programming. Automatica, 
44(6):1559-1568, 2008. 

[8] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach 
to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009. 

[9] L. Bottou and Y. Le Cun. On-line learning for very large data sets. Applied Stochastic Models 
in Business and Industry, 21(2):137-151, 2005. 

[10] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008. 

[11] S. Shalev-Shwartz and N. Srcbro. SVM optimization: inverse dependence on training set size. 
In Proc. ICML, 2008. 

[12] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for 
svm. In Proc. ICML, 2007. 

[13] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. 
In proc. COLT, 2009. 

[14] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. 
Journal of Machine Learning Research, 9:2543-2596, 2010. 

[15] J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. 
Journal of Machine Learning Research, 10:2899-2934, 2009. 

[16] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for 
machine learning. In Adv. NIPS, 2011. 

[17] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for 
stochastic strongly-convex optimization. In Proc. COLT, 2001. 

[18] F. Bach. Sclf-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384- 
414, 2010. 

[19] J. Laffcrty, A. McCallum, and F. Pereira. Conditional random hclds: Probabilistic models for 
segmenting and labeling sequence data. In Proc. ICML, 2001. 

[20] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. 
Wiley & Sons, 1983. 

[21] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower 
bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE 
Transactions on, 58(5):3235-3249, 2012. 



22 



[22] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 
133(l-2):365-397, 2012. 

[23] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) 
convergence rate for projected stochastic subgradient descent. Technical Report 1212.2002, 
ArXiv, 2012. 

[24] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential 
convergence rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012. 

[25] Y. Nestcrov. Introductory lectures on convex optimization: a basic course. Kluwcr Academic 
Publishers, 2004. 

[26] A. Juditsky and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex 
functions. Technical Report 00508933, HAL, 2010. 

[27] S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex program- 
ming algorithms. In Adv. NIPS, 2009. 

[28] H. B. McMahan and M. Streeter. Open problem: Better bounds for online logistic regression. 
In COLT/ICML Joint Open Problem Session, 2012. 

[29] I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals 
of Probability, 22(4):pp. 1679-1706, 1994. 



23 



